RNA-Seq Data Analysis ◾ 169
To keep the files organized, we will create two subdirectories in our project directory:
“refgenome” where we will store the reference genome and “gtf” where we will save the
GTF annotation file.
The sequences of reference genomes and annotation are available in many sequence data-
bases such as Ensembl, UCSC, and NCBI genome database. iGenomes built by Illumina
has facilitated the process of downloading the reference data for the frequently analyzed
organisms. Genome builds in FASTA and their annotation in GTF/GFF files from the
above major databases are available for download. The iGenomes website that includes
the download links is available at “https://support.illumina.com/sequencing/sequencing_
software/igenome.html”. Reference data can also be downloaded from “https://hgdown-
load.soe.ucsc.edu/goldenPath/hg38/bigZips/”. For aligning with STAR, we will download
the UCSC human reference genome sequence in FASTA and gene annotation in GTF file
because the chromosomes are indicated by names rather than accession numbers. While
you are in the main directory “rnaseq”, run the following bash script to create the subdi-
rectories and to download the human reference genome and its gene annotation:
mkdir refgenome
wget \
-O “refgenome/hg38.fa.gz” \
“https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.
fa.gz”
gzip -d refgenome/hg38.fa.gz
mkdir gtf
wget \
-O “gtf/hg38.ncbiRefSeq.gtf.gz” \
“https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/
hg38.ncbiRefSeq.gtf.gz”
gzip -d gtf/hg38.ncbiRefSeq.gtf.gz
Mapping reads to a reference genome using STAR is a two-step process: creating a refer-
ence sequence index and then mapping reads to the reference sequence.
The following command creates the STAR index for the reference genome sequence.
The “--runThreadN” specifies the number of processors to use, “--genomeDir” specifies
the directory where the index files will be saved, “--genomeFastaFiles” and “--sjdbGTFfile”
specify the directories of the reference genome file and annotation file, respectively, and
“--sjdbOverhang” specifies the read length -1 (read length minus one).
mkdir indexes
STAR --runThreadN 4 \
--runMode genomeGenerate \
--genomeDir indexes \
--genomeFastaFiles refgenome/hg38.fa \
--sjdbGTFfile gtf/hg38.ncbiRefSeq.gtf \
--sjdbOverhang 150